Multi-type feature fusion based on graph neural network for drug-drug interaction prediction

Background Drug-Drug interactions (DDIs) are a challenging problem in drug research. Drug combination therapy is an effective solution to treat diseases, but it can also cause serious side effects. Therefore, DDIs prediction is critical in pharmacology. Recently, researchers have been using deep learning techniques to predict DDIs. However, these methods only consider single information of the drug and have shortcomings in robustness and scalability. Results In this paper, we propose a multi-type feature fusion based on graph neural network model (MFFGNN) for DDI prediction, which can effectively fuse the topological information in molecular graphs, the interaction information between drugs and the local chemical context in SMILES sequences. In MFFGNN, to fully learn the topological information of drugs, we propose a novel feature extraction module to capture the global features for the molecular graph and the local features for each atom of the molecular graph. In addition, in the multi-type feature fusion module, we use the gating mechanism in each graph convolution layer to solve the over-smoothing problem during information delivery. We perform extensive experiments on multiple real datasets. The results show that MFFGNN outperforms some state-of-the-art models for DDI prediction. Moreover, the cross-dataset experiment results further show that MFFGNN has good generalization performance. Conclusions Our proposed model can efficiently integrate the information from SMILES sequences, molecular graphs and drug-drug interaction networks. We find that a multi-type feature fusion model can accurately predict DDIs. It may contribute to discovering novel DDIs.

depend on in vivo and in vitro experiments. However, due to its limited environment, too small scale, cumbersome and expensive process, the ability to predicting DDI is greatly limited. Therefore, an efficient computational method is needed to predict DDI.
In the past several years, people have proposed methods based on machine learning [1][2][3][4] to solve this problem. Qiu et al. [5] summarized some methods based on machine learning. Deng et al. [6] used chemical structure to learn the representation of DDIs in representation module, and then predicted some rare events with few examples in comparing module. Deng et al. [7] predicted DDI using different drug features and constructed deep neural networks (DNN). Zhang et al. [8] predicted DDI using manifold regularization.
Recently, graph-based representation learning has been applied to Drug-Drug interaction. Drugs are compounds, each of which can be represented by a molecular graph with the atom as the node and the chemical bond as the edge, or a Simplified Molecular Input Line Entry System (SMILES) sequence. In Drug-Drug interaction networks, by treating the drug as the node and the interaction as the edge, DDI prediction can be regarded as link prediction tasks. Graph neural network (GNN) has made some progress in DDI prediction [9][10][11][12][13]. Feng et al. [14] predicted DDI using Graph Convolutional Network (GCN) and DNN. In addition, there are also many methods about multi-type DDI prediction [15][16][17]. Nyamabo et al. [18] proposed to predict DDIs by the interactions between drug substructures. Then, Nyamabo et al. [19] used gating devices to learn the chemical substructures of drugs. Chen et al. [20] used the bi-level cross strategy to fuse the structural information and knowledge graph information of drugs.
Although the models mentioned have achieved significant results, there are still some limitations: (i) The models mentioned are generally limited to only considering the structure, sequence or interaction information of the drugs, without considering the synergistic effects between them. (ii) For molecular graphs, only applying GNN can extract the local features for the atoms of the molecular graph, but it is difficult to propagate the information in the graph remotely to capture the global features for the molecular graph. (iii) In drug-drug interaction networks, node features obtained by stacking multi-layer GNNs will be smoothed and blurred, which loses the diversity of node features.
To address above issues, this paper proposes an end-to-end learning framework for DDI prediction, namely MFFGNN. In MFFGNN, we first utilize deep neural networks to capture the intra-drug features from SMILES sequences and molecular graphs. For SMILES sequences, MFFGNN applies the bi-directional gate recurrent unit neural network [21] to extract local chemical context information from the sequences. For molecular graphs, MFFGNN not only utilizes graph interaction networks [22] but also graph warp unit [23] to extract both the global features for the molecular graph and the local features for each atom of the molecular graph. In addition, MFFGNN takes the intradrug features as the initial features of the nodes in the DDI network and uses GCN encoder to fuse the intra-drug features and external DDI features to update the drug representation. Finally, we predict the missing interactions in the DDI graph through Multi-layer Perceptron (MLP).
Overall, the main contributions of this paper are summarized as follows: • We propose a novel model MFFGNN for DDI prediction, which fuses the topological information in molecular graphs, the interaction information between drugs and the local chemical context in SMILES sequences. • To better learn the topological structure of drugs, we propose a molecular graph feature extraction module (MGFEM) to extract the global features for the molecular graph and the local features for each atom of the molecular graph. • We conduct extensive experiments on three real datasets with different scales to demonstrate the superiority of our model.

Drug-drug prediction
Drug-Drug prediction has always been a worthy research direction in pharmacology.
Most of previous work depended on in vivo and in vitro experiments. However, they do not scale well due to the limitations of the laboratory environment [24]. Subsequently, machine learning has been proposed to solve this problem. Similarity-based methods calculated specific similarity measures [25][26][27][28][29], e.g., drug structure, targets, side effects, genomic properties, therapeutic, etc., while combined with machine learning models for drug prediction. Ryu et al. [30] predicted the type of drug-drug interactions using DNN based on the similarity of the chemical structure of drugs. Graph-based methods predicted drug-drug interactions by learning the molecular graph [31] or interaction graph [32]. Shang et al. [33] modeled drugs as nodes and DDI as links, so tasks as link prediction problems.

Graph neural network
Recently, as a neural network method on graph domain, the study of graph neural network (GNN) has received great attention. With the development of GNN, many variants based on GNN came out one after another [34][35][36]. Rahimi et al. [37] proposed to control the transmission of neighbourhood information through gating operation. With the increasing popularity of GNN, researchers are using GNN models for DDIs [38]. For example, Duvenaud et al. [39] used GNN to perform molecular modeling by extracting molecular circular fingerprints. Lin et al. [40] used knowledge graph neural network (KGNN) to mine their associated relations in knowledge graph to solve the DDI prediction problem. Bai et al. [41] proposed to learn drug feature representation by a Bi-level Graph Neural Network (BI-GNN) to solve biological link prediction tasks. MIRACLE [42] is most relevant to our work.

Preliminaries
We define the drug set as D={d 1 , . . . , d n } and its corresponding SMILES sequence set as Q = {q 1 , q 2 , . . . , q n } , where n represents the number of drugs. We define the molecular graph as G = (V, E) , where V and E represent the sets of atoms and chemical bonds, respectively, and interaction graph as G = (G, L) , where L represents the links between drugs. We use d h to define the dimension of the representation of the atom and chemical bond and d g to define the dimension of the representation of the drug.

Problem description
The DDI prediction problem is regarded as the link prediction task on the graph. The interaction graph N can be represented by an adjacency matrix A ∈ R n×n with each element a ij ∈ {0, 1} . Given two drug nodes, the DDI prediction problem is defined to predict whether there is an interaction between them.

Overview of MFFGNN
The framework of MFFGNN is shown in Fig. 1, which is divided into the following four modules. In Molecular Graph Feature Extraction Module (MGFEM), we use the graph interaction network with graph wrap unit to extract the topological structure features of the drug from a given molecular graph. In SMILES Sequence Feature Extraction Module (SSFEM), we employ the bi-directional gate recurrent unit to extract local chemical context from a given SMILES sequence. In Multi-type Feature Fusion Module (MFFM), we apply GCN encoder to fuse the intra-drug features and external DDI features to update the drug representation. Finally, we predict the missing interactions in the DDI graph through MLP.

Molecular graph feature extraction module
The Molecular Graph Feature Extraction Module (MGFEM) is shown in Fig. 2. Molecular graphs are an important expression for drugs. We use RDKit [43] tool to construct the molecular graph G based on SMILES sequence. First, we obtain the initial features v (in) i of each atom according to atom symbol, formal charge, whether the atom is aromatic, its hybridization, chirality, etc. Similarly, we obtain the initial features e (in) ij of each bond according on the type of bond, whether the bond is in a ring, whether it is conjugated, etc. Overview of MFFGNN, where is sum. The MFFGNN uses SMILES sequences and molecular graphs as inputs to the model, and then extracts the intra-drug features through the MGFEM and SSFEM modules, respectively. Then, MFFGNN fuses the intra-drug features and external DDI features through MFFM module to obtain the updated drug features. Finally, the final predicted value is obtained by DDI predictor OH OH

Global information
Graph warp unit

Input: Molecular Graph
Initial features Graph interaction network and graph warp unit Output: updated atoms and super node features Fig. 2 Overview of MGFEM. The MGFEM module applies graph interaction network and graph wrap unit to extract local information and global information of the molecular graph. When extracting the local information, the module updates the edge feature before updating the node feature. When extracting the global information, the module utilizes a supernode to promote the global propagation of information Then, the initial atom and chemical bond features are transformed to R d h through a layer neural network, and the calculation process is as follows: where ReLU is the activation function, W (0) v and W (0) e are the learnable weight matrices. To fully extract atom and chemical bond features, we apply graph interaction networks [22]. In graph interaction network, firstly, the features of edge e ij are updated according to the features of its connected nodes and itself, and the process is as follows: where || is concatenation operation, W (l) e and b (l) e are the learnable weight matrix and the bias of the edge update, respectively. Then, the node features are updated according to the features of its connected edges and itself, and the calculation process is as follows: where N(i) represents the neighbor of node i.
The above processes can only spread the features of atoms and chemical bonds locally, but cannot spread information globally. Therefore, we propose to extract the global features of the molecular graph by applying graph warp unit (GWU) [23]. The properties of the whole drug often influence drug-drug interaction prediction. The GWU consists of three parts: supernode, transmitter and warp gate.
Supernode: We add a supernode to the graph, which can connect every atom in the molecular graph. Then, the sum of all atom features is taken as the initial feature of the supernode, g (0) ∈ R d h , that is: Then, the features of the supernode are updated by a single-layer neural network: where W (l) g are the learnable weight matrix. Transmitter: The transmitter part gathers information from the atoms and the supernode. Before propagating the atom features to the supernode, we need to transform the form of the information. Different atom features have different degrees of importance relative to the global features. Therefore, the transmitter part applies the multi-head attention mechanism to aggregate different atom features. The calculation process is as follows: v→s represents the information propagated from each atom to the supernode at the l th layer, α (k,l) v,i represents the significance score of node i at the k th head and the l th layer, ⊙ represents the product of the elements and k = 1, 2, . . . , K , K represents the number of heads. The information propagated from the supernode to each atom is calculated by the following formula: where g (l) s→v represents the information propagated from the supernode to each atom at the l th layer.
Warp Gate: The warp gate combines the transmitted information and sets the gating coefficients to control the fusion of information. For each atom, gated interpolation is used to fuse the information from the supernode g where α (l) i→s represents the gating coefficient during the transmission from atom to supernode and g (l) i→s represents the information transmitted to supernode. Finally, the updated features of each atom and supernode are calculated through the gated recurrent units (GRU) [44]: By applying this module to the whole dataset, we obtain the feature matrix G ∈ R n×d g based on the molecular graph.
i→s .

SMILES sequence feature extraction module
Drugs are commonly represented by the SMILES sequences, which are composed of molecular symbols. SMILES sequences also contain rich features compared with molecular graphs. The molecular graphs of the drug provide how the atoms are connected, while the SMILES sequences provide the functional information of the atoms and long-term dependency representations. To capture the local chemical context in SMILES sequences, we first utilized the embedding method to construct an atomic embedding matrix, and then input it into the Bi-directional Gate Recurrent Unit (BiGRU) neural network to obtain the entire drug representation. SMILES Sequence Feature Extraction Module (SSFEM) is shown in Fig. 3.
Nowadays, most methods encode SMILES sequence by label or one-hot encoding. However, one-hot encoding and label ignore the context information of the atom. Therefore, to explore the function of the atom in the context, we propose to encode SMILES sequences by an advanced embedding method, Smi2Vec [45]. Specifically, for SMILES sequences q 1 , we divide them into a series of atomic symbols by space. Then, we map each atom to an embedding vector according to the pre-trained embedding dictionary. Finally, we aggregate the embedding vectors of atoms to obtain an embedding matrix X ∈ R m×d h , in which m is the number of atoms and each row is the embedding of an atom.
We apply a layer of BiGRU [21] on the embedding matrix X . BiGRU trains the input data with two GRUs in opposite directions, as shown in Fig. 3. The current hidden state of BiGRU can be described as follows: represents a non-linear transformation of the input vector. Therefore, the hidden state s t at time t can be expressed by the weighted sum of − → s t and ← − s t , which is expressed as follows: where W t and V t represent the weights, and b t represents the bias. Then, we use a fully connected layer as the readout layer to obtain the drug representation. By applying this module to the whole dataset, we obtain the sequence-based feature matrix S ∈ R n×d g .
Note that we should input a fix-sized matrix into the BiGRU layer. However, the length of the SMILES sequence varies. We use the approximately average length of the sequences in the dataset as the fixed length and apply zero-padding and cutting operations.

Multi-type feature fusion module
We combine the feature matrices G and S obtained above to obtain the intra-drug features, namely H = G S . In order to fuse the intra-drug features with the external DDI features, we design a GCN encoder with the gating mechanism. Specifically, we take the intra-drug features as the initial node features in the interaction graphs, and then update the node representation by multi-layer GCN. The Multi-type Feature Fusion Module (MFFM) is shown in Fig. 4. For drug d i , the output of r th layer is as follows: where W r u is learnable weight parameter. Ã ij is the component of the normalized adjacency matrix Ã . Ã =K − 1 2 (A + I n )K − 1 2 where K ii = j (A + I n ) ij . We can add multiple GCN layers to expand the neighborhood of label propagation, but it may also cause the increase of noisy information. Meanwhile, the neighborhoods of different orders contain different information. Therefore, we utilize the gating mechanism [37] to control how much neighborhood information is passed to the node. The process is as follows: where T (c r−1 ) represents the gating weight of the (r − 1) th layer, (W r−1 , b r−1 ) are weight matrix and bias variable of the (r − 1) th layer. After multi-layer GCN, we finally obtain the feature matrix Z ∈ R n×d g for drugs in DDI Network.
In addition, inspired by MIRACLE, the module uses the graph contrastive learning approach to balance the information inside and outside of the drug. For the drug d i , we take itself and its first-order neighboring nodes as positive samples P and the nodes not in first-order neighbors as negative samples N. We design a learning objective, which made external features of drug d i consistent with internal features of positive samples and distinct from internal features of negative samples, defined as follows: . 4 Overview of MFFM, where G is gating and G is 1-gating. The MFFM takes the intra-drug features as the initial node features in DDI network, and then update the node representation by multi-layer graph convolution neural network with gating where f D (·) : R d g ×R d g � −→ R is the discriminator function, which scores agreement between the two vectors of the input. Here we set it to the point product operation.

DDI prediction
Firstly, we obtain an interaction link representation by multiplying two drug representation. Then, we input it into the MLP to get the prediction score: where MLP consists of two fully connected layers.
Our learning objective is to minimize the distance between the predictions and the true labels. The specific formula is as follows: where y ij is the real label for drug pair (d i , d j ) . Then, we unify the DDI prediction task and the contrastive learning task into a learning framework. Formally, the learning objective of our model is: where α is a hyper-parameter used to control the magnitude of contrastive task.

Results
In this section, we design various experiments to demonstrate the superiority of the model MFFGNN.

Experimental setup
Datasets. To verify the validity of our model on datasets with different scales, we evaluate the proposed model in small, medium, and large datasets. In the small-scale dataset, the number of drugs is relatively small, but fingerprints of all drugs are available. In the medium-scale dataset, although the number of drugs is relatively large, there is only the same number of labeled DDI links as in small-scale dataset. In the large-scale dataset, most of drugs lack many fingerprints. Detailed information about the datasets can be seen in Table 1.
Note that we removed the SMILES sequences that cannot construct the graph in the dataset.
Baselines To demonstrate the superiority of our model, we compare MFFGNN with the following state-of-the-art models: • SSP-MLP [30]: This approach used the names and structural information of drugdrug or drug-food pairs as inputs and applied Structural Similarity Profile (SSP) and MLP for classification. We name this model as SSP-MLP. • Multi-Feature Ensemble [46]: This approach combined multiple types of data and proposed a collective framework. We name this model as Ens. • GCN [48]: This approach applied GCN to perform semi-supervised node classification. We use GCN to extract structural information of drugs for DDI prediction. • GAT [35]: This approach used GAT to perform node classification task. We apply GAT to extract drug features in interaction graph for DDI prediction. • SEAL-C/AI [49]: This approach performs semi-supervised graph classification tasks from a hierarchical graph perspective. We apply this model to obtain drug features for DDI prediction. • NFP-GCN [39]: This approach designs a GCN for learning molecular fingerprints. We name this model as NFP-GCN. • MIRACLE [42]: This approach simultaneously learned the inter-view molecular structure information and intra-view interaction information of drugs for DDI prediction. • MFs [50]: This approach only used molecular fingerprints as input to the DDI network to predict DDIs, we name this model as MFs. • We also consider several multi-type DDI prediction methods and apply them to binary classification tasks, i.e. DPDDI [14], SSI-DDI [18], DDIMDL [7], MUFFIN [20].
Implementation details For the division of the datasets, the splitting method is the same as MIRACLE [42]. We divide 80% of each dataset into the training set, 20% into the test set, and 20% of the training set are randomly sampled as the validation set. The dataset only contains positive drug pairs. For negative training samples, we select the same number of negative drug pairs [51]. We utilize Adam [52] optimizer to train the model and Xavier [53] initialization to initialize the model. We utilize the exponential decay method to set the learning rate, where the initial learning rate is 0.0001 and the multiplication factor is 0.96. The model applies a dropout [54] layer to the output of each intermediate layer, where the dropout rate is 0.3. We set the dimension of the atom-level and drug-level representations as 256. We set K = 2 in the multi-head attention mechanism. To evaluate the effectiveness of the model MFFGNN, we consider three metrics, including Area Under the Receiver Operating Characteristic curve (AUROC), Area Under the Precision-recall Curve (AUPRC) and F1.

Comparison results
To verify the validity of the proposed MFFGNN, we compare MFFGNN with stateof-the-art models for DDI prediction on three datasets with different scales. Over ten repeated experiments, we give the mean and standard deviation. The best results are highlighted in bold.
Comparison on the ZhangDDI dataset We compare the MFFGNN model with stateof-the-art models on the ZhangDDI dataset, and the results are shown in Table 2. The results of these baselines are obtained from Table 2 in Ref. [42]. As can be seen, the methods considering multiple features, such as Ens, SEAL-C/AI, NFP-GCN and MIRA-CLE, perform better than the methods considering only one feature. However, the MFF-GNN has the best performance. MFFGNN considers not only the topological structure information in molecular graphs and the interaction information between drugs, but also the local chemical context in SMILES sequences. This indicates that multi-type feature fusion can improve the performance of the model.
Comparison on the ChCh-Miner dataset Because the ChCh-Miner dataset lacks fingerprints and side-effect information, we only compare the MFFGNN with the graphbased models, and the results are shown in Table 3. The results of these baselines are obtained from Table 3 in Ref. [42]. As shown in Table 3, MFFGNN outperforms all baselines in all metrics, indicating that MFFGNN still maintain its effectiveness on the dataset with few labeled data. In addition, we obtain labeled training data with different amounts by adjusting the proportion of the training set on the ChCh-Miner dataset. This can analyze the robustness of the MFFGNN. We compare MFFGNN with other methods, and the results are shown in Fig. 5a. The results show that MFFGNN has high performance even in a small amount of labeled data. The reason could be that (i) our model fuses topological structure, local chemical context and DDI relationships; (ii) our model extracts both the global features for the molecular graph and the local features for the atoms of the molecular graph; (iii) our model sets a gating mechanism for each graph convolution layer to prevent over-smoothing when stacking multi-layer GCN.
Comparison on the DeepDDI dataset To verify the scalability of MFFGNN, we perform comparative experiments on the DeepDDI dataset, and the results are shown in Table 4. Because there may be missing information in the large-scale dataset, we only choose the  SSP-MLP model. And the NFP-GCN model has worse performance and space limitation. We also ignore the experimental results. We use 881 dimensional molecular fingerprints as the initial node features in the DDI graph for DDIs prediction. Meanwhile, we degrade multi-type DDI prediction methods and obtain binary prediction results on DeepDDI dataset. As shown in Table 4, MFFGNN has high AUROC, AUPRC and F1. The MFs model is relatively poor in all metrics, which only contains one drug feature. Single feature can not comprehensively represent drug information, which will ultimately affect the prediction results. However, MFFGNN integrates the features from drug sequences and molecular graphs to input into DDI graph, so that a more comprehensive drug information can be learned. Although the SSI-DDI and MIRACLE models have higher AUROC metric than MFFGNN, MFFGNN has the highest AUPRC and F1 values. In general, the AUPRC metric is more important than the AUROC metric, because it penalizes false positive DDIs better. F1 focuses on the proportion that can correctly predict DDIs. The imbalance of the data in the DeepDDI dataset may have a negative impact on the AUROC metrics of our model. However, this does not affect the performance of MFFGNN.
Cross-dataset evaluations To further evaluate that MFFGNN has good generalization performance, we perform cross-dataset evaluations. One dataset serves as the training set, while the other two serve as test sets. Because of the poor performance of other methods, we compare MFFGNN to three methods, including GAT, SEAL-C/AI and MIRACLE, and the results are shown in Fig. 6. As shown in figures, MFFGNN outperforms the other methods in AUROC, AUPRC and F1. From the above results, it can be shown that our model can predict drug-drug interaction with steady accuracy, independent of the scale of the datasets. Through this experiment, we can also verify that MFFGNN has good generalization performance.

Ablation study
In order to verify the validity of each type of feature of drugs, we carry out DDI predictions using each type of feature or combination of feature on ChCh-Miner datasets. The experimental results are shown in Table 5. The best results are highlighted in bold.  As shown in Table 5, deleting any one of these three types of the features will damage performance. The performance is best when the three types of features are considered simultaneously. In addition, among single feature, considering only the interaction  information between drugs or the topological information of the molecular graph, the model has the great performance. Among pairwise feature combinations, considering the interaction information between drugs and the topological information of the molecular graph simultaneously performs best, and pairwise feature combinations can significantly improve performance than single feature. This suggests that multi-feature integration can better represent drugs and improve prediction results. Our model considers the global features for the molecular graph and the local features for the atoms of the molecular graph. In order to study its effectiveness, we design a variant, namely -GWU. -GWU ignores the global information in molecular graphs. As shown in Table 6, deleting the global features will damage performance. To study the validity of contrastive learning, we design a variant, called -Contrastive. This variant removes the contrastive learning from the framework. As shown in Table 6, -Contrastive is inferior to MFFGNN in all metrics. The results show that contrastive learning is beneficial to assist drug feature learning.
MFFGNN contains a GCN encoder with the gating mechanism to fully utilize the neighborhood information of different order. In order to study its effectiveness, we conduct a comparative experiment based on whether there is gating or not, and the results are shown in Table 6. The performance of the model without gating is lower than that of the model with gating. It can be proved that GCN encoder with gating is beneficial to predict DDI. From Fig. 5b, we can intuitively see the effectiveness of each component of the proposed MFFGNN.

Parameter analysis
In this section, we analyze several key parameters in the model by performing experiments on the ZhangDDI dataset, including α in the objective function of our model, the dimensionality of drug representation d g , sequence length L s , learning rate l r , the number of GCN layers L m and k of the k-head attention in the MGFEM module. We study the influence of different key parameters settings on MFFGNN by fixing other parameters.
In order to study the optimal setting of α in the objective function of our model, we vary α from 0.1 to 1.0 and fix other parameters, the results are shown in Fig. 7a. We observe that the three metrics are optimal when α is set to 0.9. On the whole, the nonzero nature of α proves the importance of contrastive learning in the model.
When training the BiGRU, we need to input a fix-sized matrix. However, the length of SMILES sequences varies. Therefore, we fix the length of the input sequence at some value and apply zero-padding and cutting operations. To study the optimal setting of sequence length, we vary L s from 50 to 250 and fix other parameters, the results are shown in Fig. 7b. Because most of the SMILES sequences in the dataset are less than 150 and greater than 100, the model performance is optimal when L s = 150 . When L s = 150 , most of the sequences do not need to be cut, and little information is lost. But, when L s = 100 , most of the sequences will lose information, and the performance is low. When the sequence length is greater than 150, even if zero-paddings are applied, the performance degradation could be trivial, because it contains enough sequence information.
In order to study the optimal setting of d g , we change it from 2 to 1024 and fix other parameters, and the results are shown in Fig. 7c. When d g is set to 256, the three metrics are optimal, and the model achieves the best performance. Specifically, with the increase of the dimensionality of drug representation, MFFGNN can extract more useful information. However, a too high dimensionality may increase noise and lead to performance degradation. Similarly, in order to study the optimal setting of l r , we change l r with {0.01, 0.001, 0.0001, 0.00001} and fix other parameters, the results are shown in Fig. 7d. When l r = 0.0001, the model performance is best.
In order to study the optimal setting of L m and k of the k-head attention in the MGFEM module, we change it from 1 to 4 and fix other parameters, the results are shown in Fig. 7e, f. For k of k-head attention, when k = 2 , the model performance is the best. As seen from the figure, as the L m increases, the MFFGNN performance improves. When L m = 3 , the three metrics are optimal and the model achieves the best performance. However, too many layers may cause overfitting and lead to performance degradation.

Discussions
Drug-Drug prediction has always been a worthy research direction in pharmacology. Most of the existing methods for predicting drug-drug interactions only consider single drug feature. However, single drug feature cannot comprehensively represent drug information, which will ultimately affect the prediction results. Our proposed model takes into account not only the topological structure information in molecular graphs and the interaction information between drugs, but also the local chemical context in SMILES sequences. Multiple drug features will represent the drug information more comprehensively. We perform DDI predictions using each type of feature or combination of features, and the experimenta results are shown in Table 5. When the three types of features are considered simultaneously, the model has the best performance. When extracting information from the molecular graph, we extract the local feature of the atoms and the global feature of the whole molecular graph. This facilitates the remote propagation of the information in graph. We demonstrate the importance of the global features of the molecular graphs in the ablation experiments, and the results are given in Table 6. In addition, To verify evaluate that MFFGNN has good generalization performance, we perform cross-dataset evaluations, and the results are given in Fig. 6. As shown in figures, our model can predict drug-drug interaction with stable accuracy, regardless of the scale of the dataset. However, our model also has some limitations, for example, it does not extend to multi-type DDI prediction tasks. In future work, we will further generalize the model to predict multi-type DDIs events.

Conclusions
In this paper, we propose a novel end-to-end learning framework for DDI prediction, namely MFFGNN, which can efficiently fuse the information from drug molecular graphs, SMILES sequences and DDI graphs. The MFFGNN model utilizes the molecular graph feature extraction module to extract global and local features in molecular graphs. Moreover, in the multi-type feature fusion module, we set up the gating mechanism to control how much neighborhood information is passed to the node. We perform extensive experiments on multiple real datasets. The results show that the MFFGNN model consistently outperforms other state-of-the-art models.